Skip to content

fix: correct NVML_FI_* field IDs and add runtime v12/v13U1 remapping#137

Merged
brayniac merged 5 commits into
rust-nvml:mainfrom
brayniac:fix/nvml-fi-v12-v13-compat
Mar 27, 2026
Merged

fix: correct NVML_FI_* field IDs and add runtime v12/v13U1 remapping#137
brayniac merged 5 commits into
rust-nvml:mainfrom
brayniac:fix/nvml-fi-v12-v13-compat

Conversation

@brayniac
Copy link
Copy Markdown
Contributor

@brayniac brayniac commented Mar 27, 2026

Summary

  • Fixes incorrect NVML_FI_PWR_SMOOTHING_* constants that used CUDA 13.0 Update 1 numbering despite the crate declaring NVML API v12, causing silent data corruption on CUDA 12 hosts
  • Adds 5 missing field ID constants (CLOCKS_EVENT_REASON_*, POWER_SYNC_BALANCING_*) at IDs 251-255
  • Shifts PWR_SMOOTHING_* constants to their correct v12 positions (256-273)
  • Adds runtime driver version detection at Nvml::init() to transparently remap field IDs when running on v13U1+ drivers (>= 580.82)

Background

NVIDIA broke ABI compatibility for field IDs 251-273 between CUDA 13.0 and 13.0 Update 1 (driver >= 580.82). See NVIDIA's known issues. The previous constants were inadvertently taken from a v13U1 source while the crate declares NVML_API_VERSION = 12.

The remapping is transparent — callers use the canonical v12 constants and field_values_for() handles translation based on the detected driver version.

Test plan

  • Unit tests for driver version detection (detect_field_id_scheme)
  • Unit tests for field ID translation (translate_field_id) covering v12 no-op, v13U1 remapping, and passthrough for unaffected IDs
  • Integration test on CUDA 12 host to verify correct field values
  • Integration test on CUDA 13.0U1+ host to verify remapping works

Fixes #134

🤖 Generated with Claude Code

brayniac and others added 2 commits March 27, 2026 10:28
The NVML_FI_PWR_SMOOTHING_* constants were using CUDA 13.0 Update 1
numbering (starting at 251) despite the crate declaring NVML API v12.
This caused silent data corruption on CUDA 12 hosts — querying power
smoothing fields would return clock event reason data instead.

NVIDIA broke ABI compatibility for field IDs 251-273 between CUDA 13.0
and 13.0 Update 1 (driver >= 580.82). This commit:

- Fixes nvml.h and bindings.rs to use correct v12 numbering
- Adds 5 missing constants (CLOCKS_EVENT_REASON_*, POWER_SYNC_*)
- Shifts PWR_SMOOTHING constants from 251-268 to 256-273
- Detects driver version at init and transparently remaps field IDs
  when running on v13U1+ drivers (>= 580.82)

Callers are unaffected — field_values_for() handles the translation.

Fixes rust-nvml#134

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@blthayer
Copy link
Copy Markdown

This looks good and seems like a solid implementation to me.

My two cents, for what it's worth: I recommend updating the test(s) for field_values_for to exercise all the NVML_FI_DEV_* fields for both v12 and v13 (you can copy + paste from my open draft PRs here and here if that's helpful), and I'd use the power of AI (or just hand-write it lol) to extend the translate_field_id_v13u1_remaps_affected_range test to be a little more thorough than the current spot-checking (obviously not strictly necessary, but mo' tests/assertions mo' better, right?)

brayniac and others added 2 commits March 27, 2026 11:00
Expand translate_field_id_v13u1_remaps_affected_range to check every
constant in the 251-273 range by name, and verify the mapping is a
bijection (no collisions).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Queries all 23 field IDs in the affected 251-273 range against real
hardware. On a v13U1+ driver (>= 580.82), this exercises the
translate_field_id remapping path end-to-end.

CLOCKS_EVENT_REASON fields should return throttle-reason data on most
GPUs. PWR_SMOOTHING fields are Blackwell-only and expected to return
NotSupported on older architectures — getting NotSupported (rather than
wrong data) confirms the correct field was queried.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@brayniac
Copy link
Copy Markdown
Contributor Author

@blthayer - good suggestions. Added some test coverage.

@thinkingfish
Copy link
Copy Markdown
Contributor

thinkingfish commented Mar 27, 2026

Given nvml.h is vendored but lacks a self-identifying version number in the header itself, I think we should add a VERSION.md that points to the package where we extract the header from. e.g. https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-nvml-dev-13-2_13.2.51-1_amd64.deb

Show a table with NAME, V12_ID, DRIVER_ID, and RESULT for each field.
Run query once instead of 3x to keep output clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@brayniac
Copy link
Copy Markdown
Contributor Author

Tested against an RTX4090 running a recent driver, remapping appears to be working.

running 1 test
Driver: 595.58.03, scheme: V13Update1
NAME                                                 V12_ID  DRIVER_ID  RESULT
------------------------------------------------------------------------------------------
CLOCKS_EVENT_REASON_SW_THERM_SLOWDOWN                   251        269  Ok(U64(0))
CLOCKS_EVENT_REASON_HW_THERM_SLOWDOWN                   252        270  Ok(U64(0))
CLOCKS_EVENT_REASON_HW_POWER_BRAKE_SLOWDOWN             253        271  Ok(U64(0))
POWER_SYNC_BALANCING_FREQ                               254        272  NotSupported
POWER_SYNC_BALANCING_AF                                 255        273  NotSupported
PWR_SMOOTHING_ENABLED                                   256        251  NotSupported
PWR_SMOOTHING_PRIV_LVL                                  257        252  NotSupported
PWR_SMOOTHING_IMM_RAMP_DOWN_ENABLED                     258        253  NotSupported
PWR_SMOOTHING_APPLIED_TMP_CEIL                          259        254  NotSupported
PWR_SMOOTHING_APPLIED_TMP_FLOOR                         260        255  NotSupported
PWR_SMOOTHING_MAX_PERCENT_TMP_FLOOR_SETTING             261        256  NotSupported
PWR_SMOOTHING_MIN_PERCENT_TMP_FLOOR_SETTING             262        257  NotSupported
PWR_SMOOTHING_HW_CIRCUITRY_PERCENT_LIFETIME_REMAINING    263        258  NotSupported
PWR_SMOOTHING_MAX_NUM_PRESET_PROFILES                   264        259  NotSupported
PWR_SMOOTHING_PROFILE_PERCENT_TMP_FLOOR                 265        260  NotSupported
PWR_SMOOTHING_PROFILE_RAMP_UP_RATE                      266        261  NotSupported
PWR_SMOOTHING_PROFILE_RAMP_DOWN_RATE                    267        262  NotSupported
PWR_SMOOTHING_PROFILE_RAMP_DOWN_HYST_VAL                268        263  NotSupported
PWR_SMOOTHING_ACTIVE_PRESET_PROFILE                     269        264  NotSupported
PWR_SMOOTHING_ADMIN_OVERRIDE_PERCENT_TMP_FLOOR          270        265  NotSupported
PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_UP_RATE               271        266  NotSupported
PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_RATE             272        267  NotSupported
PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_HYST_VAL         273        268  NotSupported
test device::test::field_values_for_v12_v13u1_remapping ... ok

@brayniac brayniac merged commit 2ba1c23 into rust-nvml:main Mar 27, 2026
7 checks passed
@brayniac brayniac deleted the fix/nvml-fi-v12-v13-compat branch March 27, 2026 20:59
@blthayer blthayer mentioned this pull request Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NVML v12 vs v13 ABI incompatibility, incorrect NVML_FI_PWR_SMOOTHING_* constants in current nvml-wrapper-sys release

3 participants